1 Introduction

Here we are going to analyse the Experimental packages of Bioconductor. See the home of the analysis here.

2 Load data

First we read the latest data from the Bioconductor project. There are two files, one with the download stats from 2009 until today and another with the download stats of the software packages, we will only use the first one:

load("stats.RData")
stats <- stats[Category == "Experimental", ]
stats
##                     Package Year Month Nb_of_distinct_IPs Nb_of_downloads
##     1:              SNAData 2017    02                  1               1
##     2:              SNAData 2017    04                  2               4
##     3:              SNAData 2017    05                  2               3
##     4:              SNAData 2016    02                  1               1
##     5:              SNAData 2016    03                  6               6
##     6:              SNAData 2016    04                  4               5
##     7:              SNAData 2016    05                  9               9
##     8:              SNAData 2016    09                  1               1
##     9:              SNAData 2016    11                  2               3
##    10:              SNAData 2015    09                  3               6
##    ---                                                                   
## 17793:          biotmleData 2017    04                 24              42
## 17794:          biotmleData 2017    05                 47              55
## 17795:          biotmleData 2017    06                 32              38
## 17796:          microRNAome 2017    06                  1               1
## 17797: sampleClassifierData 2017    01                  5               5
## 17798: sampleClassifierData 2017    02                  5               5
## 17799: sampleClassifierData 2017    03                  6               9
## 17800: sampleClassifierData 2017    04                  8              31
## 17801: sampleClassifierData 2017    05                 12              15
## 17802: sampleClassifierData 2017    06                 12              16
##            Category                Date
##     1: Experimental 2017-02-01 01:00:00
##     2: Experimental 2017-04-01 02:00:00
##     3: Experimental 2017-05-01 02:00:00
##     4: Experimental 2016-02-01 01:00:00
##     5: Experimental 2016-03-01 01:00:00
##     6: Experimental 2016-04-01 02:00:00
##     7: Experimental 2016-05-01 02:00:00
##     8: Experimental 2016-09-01 02:00:00
##     9: Experimental 2016-11-01 01:00:00
##    10: Experimental 2015-09-01 02:00:00
##    ---                                 
## 17793: Experimental 2017-04-01 02:00:00
## 17794: Experimental 2017-05-01 02:00:00
## 17795: Experimental 2017-06-01 02:00:00
## 17796: Experimental 2017-06-01 02:00:00
## 17797: Experimental 2017-01-01 01:00:00
## 17798: Experimental 2017-02-01 01:00:00
## 17799: Experimental 2017-03-01 01:00:00
## 17800: Experimental 2017-04-01 02:00:00
## 17801: Experimental 2017-05-01 02:00:00
## 17802: Experimental 2017-06-01 02:00:00

There have been 340 Experimental packages in Bioconductor. Some have been added recently and some later.

3 Packages

3.1 Number

First we explore the number of packages being downloaded by month:

theme_bw <- theme_bw(base_size = 16)
scal <- scale_x_datetime(date_breaks = "3 months")
ggplot(stats[, .(Downloads = .N), by = Date], aes(Date, Downloads)) +
  geom_bar(stat = "identity") + 
  theme_bw +
  ggtitle("Packages downloaded") +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) + 
  scal + 
  xlab("")
Packages in Bioconductor with downloads

Figure 1: Packages in Bioconductor with downloads

The number of packages being downloaded is increasing with time almost exponentially. Partially explained with the incorporation of new packages

ggplot(stats[, .(Number = sum(Nb_of_downloads)), by = Date], aes(Date, Number)) +
  geom_bar(stat = "identity") + 
  theme_bw +
  ggtitle("Downloads") +
  scal +
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  xlab("")
Downloads of packages

Figure 2: Downloads of packages

Even if the number of packages increase exponentially, the number of the downloads from 2011 grows linearly with time. Which indicates that each time a software package must compete with more packages to be downloaded.

pd <- position_dodge(0.1)
ggplot(stats[, .(Number = mean(Nb_of_downloads), 
                  ymin = mean(Nb_of_downloads)-1.96*sd(Nb_of_downloads)/sqrt(.N),
                  ymax = mean(Nb_of_downloads)+1.96*sd(Nb_of_downloads)/sqrt(.N)), 
              by = Date], aes(Date, Number)) +
  geom_errorbar(aes(ymin = ymin, ymax = ymax), width=.1, position=pd) +
  geom_point() + 
  geom_line() +
  theme_bw +
  ggtitle("Downloads") +
  ylab("Mean download for a package") +
  scal +
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  xlab("")
Downloads of packages per package. The error bar indicates the 95% confidence interval.

Figure 3: Downloads of packages per package
The error bar indicates the 95% confidence interval.

Here we can apreciate that the number of downloads per package hasn’t changed much with time. If something, now there is more dispersion between packages downloads.

3.2 Incorporations

This might be due to an increase in the usage of packages or that new packages bring more users. We start knowing how many packages has been introduced in Bioconductor each month.

today <- base::date()
year <- substr(today, 21, 25)
month <- monthsConvert(substr(today, 5, 7))
incorporation <- stats[ , .SD[which.min(Date)], by = Package, .SDcols = "Date"]
histincorporation <- incorporation[, .(Number = .N), by = Date, ]
ggplot(histincorporation, aes(Date, Number)) + 
  geom_bar(stat="identity") + 
  theme_bw + 
  ggtitle("Packages with first download") +
  scal +
  theme(axis.text.x=element_text(angle=60, hjust=1)) +
  xlab("")
New packages

Figure 4: New packages

We can see that there were more than 1500 packages before 2009 in Bioconductor, and since them occasionally there is a raise to 500 new downloads (Which would be new packages being added).

3.3 Removed

Using a similar procedure we can approximate the packages deprecated and removed each month. In this case we look for the last date a package was downloaded, excluding the current month:

deprecation <- stats[, .SD[which.max(Date)], by = Package, .SDcols = c("Date",  "Year", "Month")]
deprecation <- deprecation[Month != month & Year == Year, , .SDcols = "Date"] # Before this month
histDeprecation <- deprecation[, .(Number = .N), by = Date, ]
ggplot(histDeprecation, aes(Date, Number)) + 
  geom_bar(stat = "identity") + 
  theme_bw + 
  ggtitle("Packages without downloads") +
  scal +
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  ylab("Last seen packages")
Date where a package was last downloaded. Aproximates to the date when packages were removed from Bioconductor.

Figure 5: Date where a package was last downloaded
Aproximates to the date when packages were removed from Bioconductor.

  xlab("")
## $x
## [1] ""
## 
## attr(,"class")
## [1] "labels"

Here we can see the packages whose last download was in certain month, assuming that this means they are deprecated. It can happen that a package is no longer downloaded but is still in Bioconductor repository, this would be the reason of the spike to 3000 packages as per last month. In total there are 326 packages downloaded. We further explore how many time between the incorporation of the package and the last download.

df <- merge(incorporation, deprecation, by = "Package")
timeBioconductor <- unclass(df$Date.y-df$Date.x)/(60*60) # Transform to years
hist(timeBioconductor, main = "Time in Bioconductor", xlab = "Hours")
abline(v = mean(timeBioconductor), col = "red")
abline(v = median(timeBioconductor), col = "green")
Time of packages between first and last download

(#fig:time.package)Time of packages between first and last download

The mean time of a package in the Bioconductor is…. Not surprisingly the number of packages incorporated before 2009 and still in the repository are of 0 packages. But those packages not removed how do they do in Bioconductor?

4 Packages downloads

4.1 Ratio downloads per IP

We can start comparing the number of downloads (different from 0) by how many IPs download each package.

ggplot(stats, aes(Nb_of_distinct_IPs, Nb_of_downloads, col = Package)) + 
  geom_point() + 
  theme_bw + 
  geom_smooth(method = "lm") + 
  xlab("Number of distinct IPs") + 
  ylab("log10(Number of downloads)") + 
  ggtitle("Downloads by different IP") +
  geom_abline(slope = 2) + 
  guides(col = FALSE)
Downloads and distinct IPs of all months and packages. Each color is a package, the black line represents 2 downloads per IP.

Figure 6: Downloads and distinct IPs of all months and packages
Each color is a package, the black line represents 2 downloads per IP.

Not surprisingly most of the package has two downloads from the same IP, one for each Bioconductor release (black line). However, there are some packages where few IPs download many times the same package, which may indicate that these packages are mostly installed in a few locations.

ratio <- stats[, .(slope = coef(lm(Nb_of_downloads~Nb_of_distinct_IPs))[2]), by = Package]
ratio <- ratio[order(slope, decreasing = TRUE), ]
ratio <- ratio[!is.na(slope), ]
ratio$Package <- as.character(ratio$Package)
ratio
##                          Package      slope
##   1:             breastCancerNKI  4.8135334
##   2:                       DLBCL  4.2674435
##   3:                    Neve2006  4.1159367
##   4:               bronchialIL13  3.6021538
##   5: Illumina450ProbeVariants.db  3.2879595
##   6:                 parathyroid  3.2434080
##   7:                   facsDorit  3.2349722
##   8:                 DeSousa2013  3.1993847
##   9:                    geuvPack  3.0943912
##  10:        beadarrayExampleData  2.9841274
##  ---                                       
## 321:        prostateCancerGrasso  1.0261959
## 322:              Affymoe430Expr  1.0217286
## 323:                     ESNSTCC  1.0000000
## 324:        prostateCancerTaylor  0.9683099
## 325:  Single.mTEC.Transcriptomes  0.9637681
## 326:                   RITANdata  0.8461538
## 327:                 biotmleData  0.6381418
## 328:                 SVM2CRMdata  0.5399736
## 329:              M3DExampleData  0.4134078
## 330:                   MIGSAdata -0.3488372

We can see that the package with more downloads from the same IP is breastCancerNKI, followed by, DLBCL, Neve2006 and the forth one is bronchialIL13. AT the moment I last edited this manually, the first one is for Chip-seq, the second one for flow cytometry, and the third and forth one is for chromatographically separated and single-spectra mass spectral data, maybe few locations use these packages.

I am curious how are the default packages of Bioconductor downloaded, let’s see where they are:

ratio[Package %in% bioc_packages, ]
## Empty data.table (0 rows) of 2 cols: Package,slope

Only BiocInstaller is installed more than once per IP.

Now we explore if there is some seasons cycles in the downloads, as in figure ?? seems to be some cicles.

4.2 By date

First we can explore the number of IPs per month downloading each package:

ggplot(stats, aes(Date, Nb_of_distinct_IPs, col = Package)) + 
  geom_line() + 
  theme_bw +
  ggtitle("IPs") +
  ylab("Distinct IP downloads") +
  scal +
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  guides(col = FALSE)
Distinct IP per package

Figure 7: Distinct IP per package

As we can see there are two groups of packages at the 2009 years, some with low number of IPs and some with bigger number of IPs. As time progress the number of distinct IPs increases for some packages. But is the spread in IPs associated with an increase in downloads?

ggplot(stats, aes(Date, Nb_of_downloads, col = Package)) + 
  geom_line() + 
  theme_bw +
  ggtitle("Downloads per IP") +
  ylab("Downloads") +
  scal +
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  guides(col = FALSE)
Downloads per year

Figure 8: Downloads per year

Surprisingly some package have a big outburst of downloads to 400k downloads, others to just 100k downloads. But lets focus on the lower end:

ggplot(stats, aes(Date, Nb_of_downloads, col = Package)) + 
  geom_line() + 
  theme_bw +
  ggtitle("Downloads per package every three months") +
  ylab("Downloads") +
  scal +
  ylim(0, 50000)+
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  guides(col = FALSE)
Downloads per year

Figure 9: Downloads per year

There are many packages close to 0 downloads each month, but most packages has less than 10000 downloads per month:

ggplot(stats, aes(Date, Nb_of_downloads, col = Package)) + 
  geom_line() + 
  theme_bw+
  ggtitle("Downloads per package every three months") +
  ylab("Downloads") +
  scal +
  ylim(0, 10000)+
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  guides(col = FALSE)
Downloads per year

Figure 10: Downloads per year

As we can see, in general the month of the year also influences the number of downloads. So we have that from 2010 the factors influencing the downloads are the year, and the month.

Maybe there is a relationship between the downloads and the number of IPs per date

ggplot(stats, aes(Date, Nb_of_downloads/Nb_of_distinct_IPs, col = Package)) + 
  geom_line() + 
  theme_bw +
  ggtitle("IPs") +
  ylab("Ratio") +
  scal +
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  guides(col = FALSE)
Ratio downloads per IP per package

Figure 11: Ratio downloads per IP per package

We can see some packages have ocasional raises of downloads per IP. But for small ranges we miss a lot of packages:

ggplot(stats, aes(Date, Nb_of_downloads/Nb_of_distinct_IPs, col = Package)) + 
  geom_line() + 
  theme_bw +
  ggtitle("IPs") +
  ylab("Ratio") +
  scal +
  theme(axis.text.x=element_text(angle=60, hjust=1)) + 
  guides(col = FALSE) +
  ylim(1, 5)
Ratio downloads per IP per package

Figure 12: Ratio downloads per IP per package

But most of the packages seem to be more or less constant and around 2.

5 Models

One problem to compare the evolution of the packages is that they started at different moments, and as seen with time the number of downloads have been increasing as well as the number of packages. So we need to normalize the starting dates:

norm <- stats[, .(Norm = as.numeric(Date)/as.numeric(max(Date)), 
                   Downloads = Nb_of_downloads/max(Nb_of_downloads)), by = Package]
ggplot(norm, aes(Norm, Downloads, col = Package)) + 
  geom_line() + 
  theme_bw() + 
  ggtitle("Downloads per stage of the package") +
  xlab("Date normalized") + 
  guides(col = FALSE)
Normalization of dates and downloads

Figure 13: Normalization of dates and downloads

We can observe a tendency to have a decrease of the number of downloads after being includedd in Bioconductor and later it raises again.

SessionInfo

sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.2 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.10.4 ggplot2_2.2.1     BiocStyle_2.4.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.10     knitr_1.15.1     magrittr_1.5     munsell_0.4.3   
##  [5] colorspace_1.3-2 stringr_1.2.0    highr_0.6        plyr_1.8.4      
##  [9] tools_3.4.0      grid_3.4.0       gtable_0.2.0     htmltools_0.3.6 
## [13] yaml_2.1.14      lazyeval_0.2.0   rprojroot_1.2    digest_0.6.12   
## [17] tibble_1.3.0     bookdown_0.3     evaluate_0.10    rmarkdown_1.5   
## [21] labeling_0.3     stringi_1.1.5    compiler_3.4.0   scales_0.4.1    
## [25] backports_1.0.5